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Abstract 

We report scaling results on the world's largest supercomputer of our recently 
developed Billions-Body Molecular Dynamics (BBMD) package, which was es- 
pecially designed for massively parallel simulations of the atomic dynamics in 
structural glasses and amorphous materials. The code was able to scale up to 72 
racks of an IBM BlueGene/P, with a measured 89% efficiency for a system with 
100 billion particles. The code speed, with less than 0.14 seconds per iteration 
in the case of 1 billion particles, paves the way to the study of billion-body 
structural glasses with a resolution increase of two orders of magnitude with 
respect to the largest simulation ever reported. We demonstrate the effective- 
ness of our code by studying the liquid-glass transition of an exceptionally large 
system made by a binary mixture of 1 billion particles. 

Keywords: 

supercooled liquids, large scale parallel computing, molecular dynamics, 
thermodynamics of glasses 



Email address: andrea.fratalocchi@kaust.edu.sa (Andrea Fratalocchi 3 ) 
URL: www.primalight.org (Andrea Fratalocchi 3 ) 



Preprint submitted to Journal of Computational Physics 



January 26, 2013 



1. Introduction 

Recently, contributing to the 2009 annual survey of the EDGE foundation 
[1] entitled "What will change everything" 0], T. Sejnowski foresees that the 
computers will be the "microscopes of the future" and, specifically, that are com- 
puters that "have made it possible to localize single molecules with nanometer 
precision and image the extraordinary complex molecular organization inside 
cells" . In his contribution Sejnowski was recognizing the importance of auto- 
matic and computerized control of laser beam and image analysis in modern 
optical microscopy. However, more importantly, computers can be also consid- 
ered as " future microscopes" for the capability of simulating at the atomic scale 
the behavior of matter and biological systems. To reach this goal, which was 
even unthinkable up to some decades ago, it is necessary to develop software 
able to simulate the newtonian dynamics of a large number of atoms (order 
10 12 , the "number" of atoms in a cell) for long enough time (order microsecond, 
or 10 10 time steps) [H, The state-of-the art supercomputers, on the hard- 

ware side, and the parallel simulation codes, on the software one, are nowadays 
on the way to get this result. However, before facing the ultimate problem of 
simulating the complexity of life (micro-) organisms, one should validate and 
optimize codes that simulate hard (physical and chemical) systems. Even if 
smaller than a micro-organism, these systems encompass problems which are 
extremely demanding in terms of computational resources, with e.g., simulation 
boxes containing pico-molar quantity of matter and with characteristic times 
of 0.1 microseconds. Molecular Dynamics (MD) Q is expected to be the key 
to efficiently solve these type of problems. As each generation of computers is 
introduced, in fact, larger and longer simulations are allowed to be run, thus 
producing better results which answer a variety of new scientific questions. The 
latest generation of terascale and petascale supercomputers (see e.g., 0, 0, S|), 
in particular, holds the promise to enable the development of more realistic 
and complex interactions, as well as the study of systems made by a very large 
number (« 10 10 ) of particles. In the most common paradigm used today, called 
massively parallel processing, the typical size of computer clusters ranges from 
several thousand to hundreds of thousands of processing core units. In these 
architectures, a high degree of parallelization is essential to efficiently utilize the 
computational power available. While in principle a high level parallelization 
strategy works fine for small to medium size supercomputer clusters, tuning for 
specific architectures can be the key to achieve huge scaling performance. The 
design concepts of today's processors, in fact, are markedly different from one 
system to another, and it is necessary to prepare codes having specific architec- 
tures in mind in order to optimize both the speed and the bandwidth of memory 
access, which is typically slow if compared to the processor's frequency. 
With reference to molecular dynamics, several methods have been published in 
the past which incorporate different degrees of parallelism Q. To date, MD 
scaling has been demonstrated up to ten thousand of cores, with a speed of 
about 7 iterations per second for an ensemble of 1 billion particles running on 
65536 cores of a BlueGene/L Q. For several applications of molecular dynamics, 
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such as the study of structural glasses [l(| where a typical simulation requires 
10 6 ~ 7 time steps, this speedup is still too small to perform practical calculations. 
In the study of glasses, and in the more general area of amorphous materials, 
molecular dynamics simulations are extremely important as they are able to 
glimpse the system dynamics at spatial scales between 1 — 100 nm, a range that 
is completely inaccessible with experimental apparatus ll|,|l2j. However, due to 
such speed bottleneck problems, present MD studies on glass have been limited 
to a number of particles (10 million [13]) unable to have simulation boxes large 
enough to capture the interesting phenomenology at the micron-scale. There- 
fore, if on one hand it is necessary to improve the scaling of general MD codes at 
least two orders of magnitude, in order to effectively use the computing power 
available, on the other it is also important to focus on specific research fields, 
such as the context of amorphous materials, which are now asking for new levels 
of performances. 

In this article we report the results of our recently developed Billions Body 
Molecular Dynamics (BBMD) package, and demonstrate its effectiveness in the 
study (as case study) of structural glasses by analyzing the glass formation of 
an exceptionally- large particles system. The BBMD code was able to scale on 
the whole 294912 cores of the BlueGene/P system at the Jiilich Supercomputing 
Centre, which constitutes the world's largest supercomputer available charac- 
terized by 72 racks of an IBM BlucGcne/P. In such an extreme scaling test, 
the BBMD code showed an efficiency of about 90%, and a overall speed of 2 
seconds x iteration with 100 billion particles. These results pave the way to 
the study of very large systems. In order to demonstrate the applicability of 
our code to the field of amorphous materials, we performed a controlled tem- 
perature MD simulation of a system made by 1 billion particles, and studied 
the supercooled dynamics of the liquid state by varying the temperature in the 
range T £ [2, 10~ 4 ]. This simulation has been performed on the Shaheen su- 
percomputer, hosted at KAUST University and constituted by 16 racks of an 
IBM BlueGene/P. This paper is organized as follows. We begin our analysis by 
discussing the structure of the BBMD code (Sec. [5]), reviewing both paralleliza- 
tions and optimization strategies. In Sec. [3J we describe BBMD scaling results 
obtained at the Jiilich supercomputing center. We report the code speedup for 
molecular systems of different size, ranging from 1 billion to 100 billion particles. 
Communication workload versus calculation execution time is also studied. In 
Sec. EJ finally, we investigate the glassy dynamics of a 1 billion particles system 
made by a binary mixture of soft-spheres. 



2. The BBMD code 

The BBMD code is a highly optimized, parallel C++ MD code for Lennard- 
Jones particles systems, designed to scale on machines characterized by hundreds 
of thousands of processors, such as the latest generation of IBM BlucGene/P 
supercomputers. Much effort was taken to balance design simplicity and code 
speed, while optimizing at the same time both memory requirements and cache 
efficiency. In the following, we briefly describe the main structure of the code. 
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Spatial domain 




Figure 1: Main structure of the BBMD parallel code: (right) the simulation domain is spatially 
decomposed into rectangular boxes, each defining a single MPI process (center) that, in turn, 
is constituted by several cubic MD cells of side r c (left). Each MD cell contains a bidirectional 
list with all particle informations: position x, velocity v, acceleration a, mass m and species 
s. 



2.1. Interaction potential 

BBMD is originally designed to support two main classes of short range in- 
teraction potentials: 



• The Lennard- Jones potential 

V(nj) = 4e Q/3 



r / \ 12 













(1) 



• The soft-sphere potential 



'in 



(2) 



being a and e two tensors that parametrize the potentials, and the distance 
between particles i,j of species a and (3, respectively. Since Eqs. (P)-© con- 
verge rapidly to zero beyond w | |cr| | , it is wasteful to consider an interaction 
between two particles at a long distance. A standard choice in MD is therefore 
to truncate the potentials beyond the distance r c , in order to increase calcula- 
tion's speed. To avoid any problem due to the discontinuity of V(r) at r = r c , 
we replace Eqs. CO)-© wrtn the following potential: 



V*(r) 



V(r) - V(r c ) - (r - r c ) 



dV(r) 



[1 - 6(r c )] 



(3) 



with 0(x) being the Heaviside function. The modification © applies across the 
entire interaction range, and the overall spatial domain is decomposed into MD 
cubic cells of r c x r c x r c volume (Fig. [T]). 
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2.2. Grid-search method 

To perform the time evolution of the system, forces between particles have 
to be calculated and particle pairs whose distance is below the cutoff range r c 
have to be found. For this task, we adopt an O(N) linked-list method, with an 
inexclusive Q grid that allows more than one particle to occupy a single cell. 
Newton's third law is then applied in order to halve the number of neighbors 
to be checked in the calculation of the forces. In order to guarantee optimal 
memory efficiency, we developed a bidirectional list structure that contains all 
the information of the particles: position x, velocity v, acceleration a, mass m 
and species index s (Fig. [1]). The double pointer mechanism allowed an efficient 
implementation of memory-related operations (e.g., movement of a particles on 
a different cell or on a different processor) without deleting or creating memory 
locations but just by moving pointers, which is very fast. Another advantage 
of such memory structure is that it automatically links together particles which 
are close in space, thus improving speed efficiency in all search operations. To 
maintain memory requirement as low as possible, we do not employ any book- 
keeping method, such as the neighboring particle list Q, which are too memory 
demanding for billion sized particle systems. 

2.3. Parallelization 

2.3.1. Parallelization scheme 

BBMD has been parallelized with the domain decomposition strategy, also 
known as spatial decomposition, where the simulation box is divided into sub- 
domains and each domain is assigned to each processor (Fig. [1]). The Message 
Passing Interface (MPI) standard is then employed to handle all parallel com- 
munications among processes. This type of parallelization has the advantage 
to require only nearest-neighbor communications, with a few limited global col- 
lective operations, and it is therefore well suited for very large supercomputer 
clusters such as the IBM BlueGene series. 

2.3.2. Parallel force calculation 

BBMD employs the velocity- ver let time marching algorithm Q, which is a 
standard in MD algorithms due to its robustness and accuracy in maintaining 
conserved quantities, such as the energy and the momentum. In this evolution 
scheme, the hot spot of the algorithm is the computation of the forces exerted 
among particles. To perform this calculation, two parallel operations are re- 
quired: (i) moving particles that stray from the originally associated process 
and (ii) exchanging particles that belong to borders crossing different proces- 
sors. In BBMD, particular care has been taken in the design of (i) and (ii) 
in order to overlap communications and calculations to the maximum extent 
possible, while optimizing speed and cache efficiency. More specifically, the par- 
allel communication starts with the operation (i) and then proceeds with (ii). 
In both steps, a one dimensional array containing particles properties (i.e., x, 
v, a, m, s) needs to be constructed, and the number of particles to be sent to 
neighboring processors has to be calculated. In BBMD, these two operations are 
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Figure 2: BBMD strong scaling results: code speed versus number of processors. 

overlapped with send/recv MPI communication routines in order to minimize 
communication time with respect to calculation. The task (i) begins by tagging 
all the particles that belongs to different processors, and sorting them on-the-fly 
thus minimizing access times in subsequent MPI communications. This is done 
by exploiting the bidirectional nature of the particle list. In particular, each 
particle that needs to be sent to a different process is first moved on the tail 
of its cell list with an O(l) operation, and then tagged by inverting its mass 
sign. Such tagging procedure avoids the use of an external index and increases 
the memory efficiency of the code. Besides that, this method automatically 
groups tagged particles together, and allows to access them sequentially with 
O(l) operations when MPI send/recv communications are performed. Once the 
task (i) has been completed, the exchanging of the nearest-neighbor is done 
with the standard ghost-cell approach In both (i) and (ii) tasks, one single 
MPI process is required to communicate with his 26 neighbors. Although being 
characterized by nearest neighbors communication only, a naive implementa- 
tion of this algorithm requires 26 different communications and results in poor 
scaling performances when a large number of processor is employed. However, 
by taking advantage of the domain decomposition strategy employed in BBMD. 
it is possible to reduce the number of total communications to just 6. This is 
achieved by properly enlarging the communication window, and in particular 
by exchanging part of the ghost cells during each MPI communication (see e.g., 
[5[ for more details). 
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2.3.3. BlueGene specific optimizations 

In the calculation of the sqrt function, required by the computation of Eq. 
([3]), we employ the square root reciprocal BlueGene function frsqrte, coupled 
with two Newton-Rapson iterations. More specifically, we replaced the code 
segment: 

ri j=sqrt (r2) ; 

with: 

rij = frsqrte(r2); 

rij = ((0.5 * rij) * (3.0 - r2 * (rij * rij))); 
rij = ((0.5 * rij) * (3.0 - r2 * (rij * rij))); 
rij = rij * r2; 

This optimization results in a speed increment of about 7%. 

3. BBMD scaling results 

The evaluation of BBMD code performances has been carried out on the 
Jugene system at the Jiilich Supercomputing Center Q, which is composed of 
294912 cores (or 72 racks) of an IBM BlucGcnc/P with total peak performance 
of 1 PFlops. The test suite consisted of a series of canonical molecular dynamics 
simulations of a 20 : 80 binary mixture of soft-spheres with the following param- 
eters (here all the units are to be intended as normalized MD units [j|): mi = 1, 
(Tn = 1, 0"i2 = 0.8 , 022 = 0.88, en = 1, ei2 = 1.5, €22 = 0.5. In the canonical 
evolution, the temperature T has been fixed to T = 0.5, and the density p to 
p = 1.2 in order to maintain the system in the liquid state with all the parti- 
cles randomly fluctuating among different processors. Figure [2] displays BBMD 
strong scaling results for systems characterized by 1, 10 and 100 billion particles. 
Although the memory footprint of BBMD permits the number of particles to be 
increased by up to two orders of magnitude, we concentrated on a range where 
the code speed was sufficiently fast to implement realistic calculations. In the 
smallest case of a system with 1 billion of particles, BBMD was able to scale well 
beyond 10 racks and obtained and efficiency of 68% on 16 racks when compared 
to a single rack. When the number of particles was increased from 1 billion to 
10 billion, 70% of efficiency was achieved between 1 rack and 64 racks. This 
number was improved even further for systems containing 100 billion particles. 
In this configuration, BBMD reached an efficiency of 89% on 72 racks (or 294912 
cores) when compared to 8 racks. 
To investigate the weak scaling performances of the code, we perform a series 
of runs with a fixed number of cores and with system sizes varying between 
N = 1 billion and N — 100 billion. Figure [3] shows the result of such analysis. 
When the particles are increased up to three orders of magnitude, the BBMD 
code shows a perfect linear O(N) complexity. The improved scaling is due 
to the proportion of computation with respect to communication time, which 
appreciably increases when the number of particles grow (Fig. |4]). 
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Figure 3: BBMD weak scaling results: code speed versus number of particles. 
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As seen in Fig. [H the communication impact over computation is practically 
negligible and the overall execution time is dominated by calculations. This 
is not only the result of the three dimensional spatial domain decomposition 
employed, which significantly reduces the volume of each MPI process and the 
number of elements to be exchanged during each iteration, but also relies on 
the specific overlapping strategy used in the parallel calculation of the forces 
(Section [2X2]), which minimizes MPLWaitall times. 

In the case of a system with 1 billion of particles, we measured a code speed 
of 0.14 seconds x iteration, which is sufficiently small to enable the study of 
billion-body structural glasses. Such a system size is two orders of magnitude 
higher than the largest glass simulation ever reported, and will improve up to 
two orders of magnitude the resolution of MD measurements on glassy dynamics 




4. A glass transition in a 1 billion binary mixture of soft spheres 

4-.1. A few words about the glass phase 

When the temperature of a liquid is reduced below the melting tempera- 
ture, two possible physical phenomena may occur: either the system undergoes 
crystallization -a first order phase transition- where the final ordered configu- 
ration is the thermodynamically stable phase, or else the liquid may become 
supercooled and get dynamically arrested into a disordered solid represented by 
a glass. Despite a large scientific production, the phenomenology of the glass 
state is still far from being completely understood, and many aspects are yet 
unknown. As an example, it is not yet clear whether the dynamical arrest is 
a genuine thermodynamic phase transition, a kinetic phase transition er some- 
thing else, as the real world counterpart of a phase transition taking place in the 



trajectories' phase space as recently suggested by Chandler and co-workers [14 1. 
In the last decades, in particular, several different theories have been suggested 
and various diverse analysis have been pursued HI 



Notwithstanding the wide diversity of views, a consensus exists over the consid- 
eration that a glass transition is a dynamical crossover through which a viscous 
liquid falls out from equilibrium and becomes solid on the experimental time 
scale. Such a process is manifested in a gradual change in the slope of the 
volume (or other extensive thermodynamic variables such as the entropy or the 
entalphy) at a specific temperature T gi which defines the glass-transition tem- 
perature. A fundamental property of glasses is the existence of a high degree of 
frustration in their ground state energy configuration 19]. Frustration, in turn, 
is manifested by the existence a huge number of minima of equivalent energy 
(metastable states). When the temperature decreases below a specific thresh- 
old, the energy barriers of the various minima becomes sufficiently large to trap 
the dynamics in the configuration space and let it explore only a subset of the 
available iso-energy surface. The result of this dynamical arrest is a disordered 



solid that defines the glass phase [18 1 
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Figure 5: Temperature T curve as a function of the time step employed in the study of the 
liquid-glass transition in a 1 billion particles system. 




temperature T temperature T 

Figure 6: Caloric curves U(T) obtained from MD simulations (circle markers) and relative 
least square fit (solid and dashed lines). Figure (b) is an enlarged section of (a) near the glass 
temperature T g . 
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4-2. Sample setup and simulation results 

It is worth to mention that on cooling a liquid, if crystallization is avoided, 
the atomic dynamics is characterized by a relaxation time (naively, one can 
thing to the typical time needed to change the inherent "configuration") that 
increase following an Arrhenius law in strong glass forming systems, and even 
faster in fragile glass forming systems 2(J 21 1, reaching the value of 100 s at 
the glass transition temperature. A supercooled liquid above, but close to, the 
glass transition temperature can be considered at equilibrium only if its atomic 
dynamics is investigated for log time, longer that the relaxation time, a macro- 
scopic time. As a consequence, during a MD run, necessarily lasting for time 
much shorter than the relaxation times close to the glass transition, one falls 
out of equilibrium as soon as the temperature became that temperature where 
the relaxation time correspond to the simulation times. As a consequence, in a 
MD simulation the temperature where the system becomes trapped in a specific 
inherent structure is largely higher than the real glass transition temperature. 
In a molecular dynamics simulation, a liquid-glass transition can be observed 
as a continuous transition of the extensive thermodynamic parameter as energy 
or specific volume. 

To study this transition, we employed the same binary mixture used for our 
benchmark suite in Section [3] the high density p = 1.2 and the 20 : 80 highly 
asymmetric mixture configuration, in fact, neglects the existence of a well de- 
fined solid phase for the system and the low temperature ground state becomes 
highly frustrated [22[ • We therefore expect the observation of a glass transition 
in the dynamics as soon as the temperature is decreased to a small enough value. 
Figure [5] shows the temperature curve employed for our MD simulation. At the 
beginning, we rapidly increase the system temperature to a high value T = 3. 
In this state, the two particles species have sufficient kinetic energy to freely dif- 
fuse over the whole simulation box. After 200000 time steps, the temperature 
starts to decrease by following a slowly-varying linear curve, whose duration is 
of 3 • 10 6 time steps. In order to guarantee stability and energy conservation over 
the whole simulation, we adopt a time t resolution of St — 10 -3 s. The cutoff 
range has been set to r c = 1.2, which guarantees a sufficiently large interaction 
distance to observe the formation of a supercooled liquid. Figure [5] displays the 
caloric curve of the system, and shows the behavior of the potential energy U 
versus the temperature T. For temperature values well above T g = 0.2, the 
system is in the liquid state and follows a continuous power law curve with 
T = T* , as found from a nonlinear least square fit procedure applied on the 
energy U and as expected from the theory of Tarazona [23} . As we progressively 
reduce the temperature beyond the value T g , we observe the appearance of a 
continuous transition, characterized by a radically change in the derivative of 
U with respect to T. For temperature values below T g , which defines the glass 
temperature [l(| , the energy U varies linearly with the temperature and the sys- 
tem gets trapped into an arrested phase with all the particle oscillating almost 
harmonically around their equilibrium configuration. Figure [7] displays a 16000 
particles portion of the billion-body solid formed at T — 10~ 3 . As expected from 
the thermodynamic analysis based on the caloric curve, the solid configuration 
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Figure 7: A 16000 particles portion of the resulting glass at T = 10 3 . The black particles 
are of species 1, the gray belongs to species 2. 

reached is that of a structural glass, whose particles are randomly arranged in 
space. The structural properties of the glass are analyzed by calculating the 
radial distribution function <?(r), defined as: 



being V the sample volume, N the total number of particles, the distance 
between particles i and j and < ... > denoting an ensemble average. The 
radial distribution function describes the spherically averaged local organization 
around a specific atom, and it measures the probability of finding an atom at a 
distance r from a given particle. Figure [8] displays the g(r) for the billion-body 
glass obtained at T = 0.001. The glass is characterized by a sharp peak at r « 1, 
which yields the average minimum interparticle distance at equilibrium, followed 
by two broader peaks and a series of oscillations of decreasing amplitude around 
the asymptotic value g(r — > oc) = 1. The presence of well defined broad peaks 
in the radial distribution function is the hallmark of a structurally disordered 
phase, with particles oscillating around randomly arranged spatial sites. This 
dynamics is smeared out as long as the interparticle distance increases, and all 
correlations are dramatically reduced after r ss 4, thus indicating the lack of any 
long range positional order in the system. The g(r) reported in Fig. [5] compares 
favorably with previous determination in soft sphere systems. 





12 



6 
5 
4 

Is 

2 
1 

°0 1 2 3 4 
atom distance r 

Figure 8: The radial distribution function g(r) of the glass at T = 10 — 3 . 

5. Conclusions 

We have developed a parallel code which demonstrates the applicability of 
molecular dynamics techniques to the study of billions-body structural glasses. 
We tested our code on the world's largest supercomputer available, namely the 
Jugene BlueGene system at the Jiilich Supecomputing center, and demonstrated 
scalability on the full machine [characterized by 294912 computing cores] with 
an efficiency of 89% in the largest configuration of 100 billions of particles. We 
then applied our code to the case study of the supercooled dynamics of an 
exceptionally large system, constituted by an highly frustrated binary mixture 
of one billion of particles. This simulation paves the way to the computational 
study of billions-body structural glasses, thus achieving new levels of resolution 
in the analysis of anomalous vibration of amorphous materials and, on broader 
perspective, of living matter. 
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